Skip to content

Conversation

@rkistner
Copy link
Contributor

@rkistner rkistner commented Nov 20, 2025

Background

Currently, the replication process is effectively linear / "single threaded". When a new sync rules version is deployed, we create a new replication stream, which performs a snapshot on each table, then starts streaming. This has a couple of limitations:

  1. Replication is slower than it needs to be due to not being able to replicate tables concurrently.
  2. After the initial table snapshots are complete, there could be a significant replication lag that we need to catch up on.

The changes here are also part of the bigger project to implement differential sync rule updates - only re-replicating for changed bucket definitions / sync stream definitions. Part of that requires switching to a single replication stream for all copies of sync rule versions, and this builds the base to implement that.

Changes to storage implementation

Firstly, this changes the underlying BucketStorageBatch implementations to be safe under concurrent usage. Right now it is still designed to only have one process doing streaming replication with commits at a time, but it is safe to have multiple other processes doing snapshots concurrently.

To implement this, we reduce the reliance on local state, instead using the database state which is safe for concurrent access. While this does not introduce any new model fields yet, we do now rely more strongly on snapshot_done (keeps track of whether or not we're waiting for any snapshots to complete) and keepalive_op (keeps track of ops persisted but not committed yet). This has the specific implication that new slots always need an explicit markAllSnapshotDone() call - this is not done automatically anymore.

This also removes the implementation difference between keepalive(lsn) and commit(lsn) - these now both do the same thing.

Changes to replication

Starting with Postgres, we now start streaming changes immediately when starting replication, even if a snapshot is required. To avoid consistency issues, we:

  1. Ignore any rows picked up by streaming replication when we do a snapshot.
  2. Implement soft-deletes to similarly prioritize deletes in the replication stream over rows in the table snapshot [PENDING].

This also splits out the snapshot implementation from the streaming replication implementation. The snapshotter keeps a queue of tables to snapshot. Currently it only snapshots one at a time, but we can change this in the future.

Tasks

  • Apply the same changes to Postgres bucket storage.
  • Re-implement the createEmptyCheckpoints logic.
  • Get all current tests passing.
  • Implement soft deletes, and add tests for that edge case.
  • Add low-level tests for concurrent storage usage.
  • Proper "job management" for the various replication tasks.
  • Docs.
  • MongoDB and MySQL implementations (future PR).

@changeset-bot
Copy link

changeset-bot bot commented Nov 20, 2025

⚠️ No Changeset found

Latest commit: ac790a7

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants